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Abstract 

To interpolate a supersparse polynomial with integer coefficients, two 
alternative approaches are the Prony-based "big prime" technique, which 
acts over a single large finite field, or the more recently-proposed "small 
primes" technique, which reduces the unknown sparse polynomial to many 
low-degree dense polynomials. While the latter technique has not yet reached 
the same theoretical efficiency as Prony-based methods, it has an obvious 
potential for parallelization. We present a heuristic "small primes" interpo¬ 
lation algorithm and report on a low-level C implementation using FLINT 
and MPl. 


1 Introduction 

Given a way to evaluate or sample an unknown function or procedure, inter¬ 
polation is the fundamental and important problem of recovering a formula 
which accurafely and complefely describes fhaf unknown funcfion. As discov¬ 
ering an arbifrary unknown funcfion from a finife sef of evaluafions wifh any 
reliabilify would be impossible, some consfrainfs on fhe size and form of fhe 
outpuf are inevifably required. 

Here we consider fhe problem of sparse polynomial interpolation, in which 
we are guaranfeed fhaf fhe unknown funcfion is a mulfivariafe pol 5 momial 
wifh bounded degree. Sparse inferpolafion algorifhms dafe fo fhe 18fh cenfury, 
buf have been fhe focus of considerable recenf work in numeric and symbolic 
compufafion, wifh applicafions ranging from power consumpfion in medical 
devices, fo reducing infermediafe expression swell in mafhemafical compufa- 
fions [Cuyf and Lee, 2008, Kalfofen, 2010, Kalfofen ef al., 2011, Boyer ef al., 
2012 ]. 

Specifically, fhis work focuses on algorifhms fo infetpolafe an unknown 
supersparse pol 5 momial wifh infeger coefficienfs, which make efficienf use of 
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modern parallel computing hardware. Focusing on the "supersparse" (a.k.a. 
"lacunary") case means that our running time will be in terms of the number 
of variables, number of nonzero ferms, and fhe logarifhm of fhe outpuf degree. 

The firsf algorifhms fo solve fhis problem in pol 5 momial-fime were based on 
fhe exponenfial sums fechnique of Prony, and can efficienfly solve fhe infeger 
problem by working in a large finife field modulo a single "big prime." A 
number of fheorefical improvemenfs and pracfical implemenfafions have been 
made in fhis vein, including work on fasf parallel implemenfafions [Javadi and 
Monagan, 2010, van der Hoeven and Lecerf, 2015]. 

We consider anofher fype of approach, whereby fhe unknown sparse infe¬ 
ger pol 5 momial is reduced in degree modulo many small primes. This fech¬ 
nique, firsf used by Garg and Schosf fo avoid fhe need for discrefe logarifhm 
compufafions in arbifrary finife fields, can also be applied fo fhe infeger poly¬ 
nomial case. While if carmof yef match the theoretical efficiency of fhe big 
primes algorifhms, we will show fhaf fhe "small primes" mefhod is very ef- 
fecfively parallelized. Furfhermore, we develop a pracfical heurisfic version 
of fhis mefhod which reduces furfher fhe size and number of primes required 
based on experimenfal resulfs rafher fhan on fhe fheorefical worsf-case bounds. 

1.1 Related work 

While if would be impossible fo lisf all of fhe relafed work on sparse inferpo- 
lation, we will mention some of fhe mosf recenf resulfs which are mosf closely 
cormecfed fo fhe currenf sfudy and which may provide fhe reader wifh useful 
background. 

The now-classical approach fo sparse inferpolafion is variously credifed fo 
Prony, Blahuf [Blahuf, 1979], or Ben-Or and Tiwari ]Ben-Or and Tiwari, 1988]. 
Only fhe lasf of fhese considered explicifly fhe case of infeger coefficienfs, buf 
all share fhe key properfy of requiring fhe minimal number O (T) of evaluafions 
in order fo recover a pol 5 momial wifh T nonzero ferms. 

We refer fo fhese approaches as "big prime" fechniques, as fhe more mod¬ 
em varianfs ]Kalfofen and Yagafi, 1989, Kalfofen, 2010] adapf fo fhe case of 
infeger coefficienfs by choosing carefully a single large modulus, fhen perform 
fhe inferpolafion over a finife field in order fo avoid exponential growth in the 
bit-length of evaluafions. 

The approach which we fake in fhis work is based on fhaf of Garg and 
Schosf [Garg and Schosf, 2009], who developed fhe firsf pol 5 momial-fime su¬ 
persparse inferpolafion algorifhm over an arbifrary finife field. By reducing 
fhe unknown pol 5 momial modulo {zP — 1), fhe full coefficienfs are discovered 
immediafely, buf fhe exponenfs are only discovered affer repealing for multi¬ 
ple values of p. This "small primes" approach, described in more defails below, 
has fhe considerable advanfage of relying only on low-degree, dense pol 5 mo- 
mial arithmefic. 

There are, of course, ofher sparse inferpolafion mefhods which do nof fif 
nicely info fhis big/small prime characferizafion. Mosf nofable are ZippeTs 
algorithm and hybrid variants of if [Zippel, 1990, Kalfofen and Lee, 2003], fhe 
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symbolic-numeric method of [Mansour, 1995], and the Newton-Hensel lifting 
approach of [Avendano et al., 2006]. 

We point out some recent work on efficient implementations which are 
of particular interest to the current study. In [Javadi and Monagan, 2010], a 
new variant of the big prime approach is developed which can be performed 
variable by variable, in parallel. More recently, ]van der Hoeven and Lecerf, 
2015] investigated a number of tricks and techniques towards practical, effi¬ 
cient sparse interpolation, and posed some new benchmark problems. Their 
methods are also based on the big prime approach; to our knowledge there 
has been no reported implementation work on the small primes technique 
other than the numerical interpolation code reported in ]Giesbrecht and Roche, 
2011]. 

1.2 Our contributions 

Suppose / G Z[xi, X 2 , ■ ■ ■ Xn\ is an unknown polynomial in n variables, with 
partial degrees less than D in each variable, at most T nonzero terms, and co¬ 
efficients less than H in absolute value. Then / can be written as 

T 

f _ \ ^ y^in 

J ^ ^2 ' • • Xyi , 

where each exponent eij < D and each |c;| < H. 

Given a way to evaluate f{9\,... ,6n) mod q, for any modulus q ^ Z and 
n-tuple of evaluation points {6i, ..., 9n) G {Z/qZ)", the sparse integer polyno¬ 
mial interpolation problem is to determine the coefficients c, and exponent tuples 
(e;i,..., e,„), for each 1 < f < T. 

We will actually consider a slight relaxation of this problem, wherein eval¬ 
uations are of the form 

{q,p,di,...,d„) /(z‘^i,...,z‘^") mod (z^ - 1) G iZ/qZ)[z]. 

That is, the n variables are replaced by a single indeterminate z, all coefficients 
are reduced modulo q, and the exponents are reduced modulo p. The reduction 
modulo (zP — 1) is possible without affecting the overall complexity whenever 
the unknown pol 5 momial / is given as a straight-line program or algebraic 
circuit, or if the prime q is chosen so that Z / qZ has a pth root of unity. 

An algorithm for this problem is said to handle the supersparse case if it 
requires a number of evaluations and rurming time which are polynomial in n, 
T, log D, and log H. This corresponds to the size of the sparse representation of 
/ as a list of coefficient-exponent tuples, which requires 0(nT log D -|- Tlog H) 
bits in memory. 

We present a randomized algorithm for the sparse integer pol 5 momial in¬ 
terpolation problem, derived from the existing literature, whose rurming time* 

* Here and throughout we use the soft-oh notation in order to simplify the stated running times: 
a running time is said to be 0{(p) for some function cp if and only if it is 0{cp(log 


3 






for provable correctness is 


O ( (l + in log d) nT log D (log D + log H)) , (1) 

where m > 1 is the number of parallel processors available for the task. Fur¬ 
thermore, we demonstrate a heuristic variant on this algorithm, which works 
well in our experimental testing, and reduces the rurming time further to 

o(nlogDlogH+(l + inlogO) TlogH) (2) 

in the typical case that n and log D are both 0(T log H). 

We have implemented this heuristic approach using the C library FLINT 
for dense pol 5 momial arithmetic and MPI for parallelization. Our experiments 
demonstrate the smallest effective settings for the parameters in our heuristic 
approach. With those parameters, we show that the heuristic method is com¬ 
petitive with the state of the art in the single-processor setting, and that its 
rurming time scales well with increasing numbers of parallel processors. 

Specifically, our contributions are: 

1. A sparse interpolation algorithm whose potential parallel speedup is O (n log D), 
compared with the 0(n) parallel speedup that has been shown in previ¬ 
ous work for other sparse interpolation algorithms. 

2. A heuristic variant of our algorithm, which is demonstrated to be effec¬ 
tive on our (limited) random experiments, that brings the rurming time 
complexity of the small primes approach to be competitive with that of 
the big prime approach. 

3. An efficient C implementation of our interpolation algorithm which demon¬ 
strates its competitiveness on a standard benchmark problem. 

1.3 Organization of the paper 

We first outline the two main classes of existing approaches to supersparse 
integer pol 5 momial interpolation, which we call the "big prime" and the "small 
primes" methods, in Section 2. We then present our own heuristic method 
in Section 3, which is based on previously-known "small primes" algorithms, 
but goes beyond theoretical worst-case bounds on the sizes needed in order to 
further improve the efficiency. 

Section 4 reports the details of our parallel implementation of this heuristic 
method, and Section 5 presents the preliminary experimental results which 
demonstrate the efficacy of this approach. Finally, we state some conclusions 
and directions for further investigation in Section 6. 
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2 Existing algorithms for supersparse interpolation 

2.1 "Big prime" methods 

The original sparse interpolation algorithm of [Ben-Or and Tiwari, 1988] used 
evaluations at powers of fhe firsf n prime numbers along wifh Prony's mefhod 
fo deferminisfically recover fhe nonzero infeger coefficienfs and exponenfs of 
an unknown pol 5 momial. This cannof be considered a "supersparse" algo- 
rifhm, as if performs fhe evaluafions over fhe infegers direcfly and requires 
working with integers with more than D bits. 

However, it was soon recognized that, by choosing a single large prime 
p > D”, with p — 1 smooth^ so as to facilitate discrete logarithms, a supersparse 
integer pol 5 momial could be interpolated efficiently modulo p [Kaltofen et al., 
1990, Kaltofen, 2010]. The basic sfeps of fhis approach are as follows: 

1. Choose prime p so fhaf (p — 1) has a large "smoofh" factor greater fhan 
D" and lef 0 be a primifive elemenf modulo p. 

2. For i = 0,1,2,..., 2T — 1, evaluate 

z;,- mod p. 

3. Compufe fhe minimum pol 5 momial T e Zp[z] of fhe sequence (Ui),>o 
wifh fhe Berlekamp-Massey algorifhm. 

4. Factor F over Zp[z\; each roof of F can be written as henD" 

corresponding fo a single ferm ■ ■ ■ Xn” of /. 

5. Compufe T discrete logarifhms of fhe roofs, and fhen fhe D-adic ex¬ 
pansion of each one, fo discover fhe acfual exponenfs {en,. for 

1 < i < T. 

6. Once fhe exponenfs are known, fhe coefficienfs can be computed from 
fhe evaluafions U; by solving a fransposed Vandermonde system of di¬ 
mension T. 

The two sfeps which can be frivially parallelized are fhe evaluafions and 
discrete log compufafions on sfeps 2 and 5. However, fhe dominating cosf in 
fhe complexify is factoring a pol 5 momial over Zp[x] on step 4. This is also con¬ 
firmed fo be fhe dominating cosf in pracfice by ]van der Hoeven and Lecerf, 
2015], and if is nof clear how fo efficienfly parallelize fhe factorization. Us¬ 
ing fhe fasfesf known algorifhms, fhe running time of fhis sfep is 0(T log^ p) 
]Grenef ef al., 2015], which is af leasf O(n^Tlog^D) bif operations from fhe 
condition p > D”. 

In the description above, the evaluations at powers of 9 on sfep 2 amounf 
fo a Kronecker subsfifufion from mulfivariafe fo univariafe. Of note is fhe al¬ 
gorifhm of [Javadi and Monagan, 2010], which uses a differenf approach fhan 

^ An integer is said to be smooth if it has only small prime factors. 
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Kronecker substitution in order to work one variable at a time, gaining a po¬ 
tential n-fold parallel speedup. 

2.2 "Small primes" methods 

The algorithm described above works the same over an arbitrary finite field, 
excepf fhaf fhe discrefe logarifhms required in sfep 5 cannof be performed in 
pol 5 momial-fime in general. This difficulfy was firsf overcome in [Garg and 
Schosf, 2009], where fhe idea is as follows: 

1. Choose a series of small primes pi, p 2 / • ■ ■ 

2. Apply fhe Kronecker subsfifufion and for each i = 1,2,... compufe /, = 
/(z,z°,... ,z^" ^)mod (xP'—1). 

3. Each /, of maximal sparsify confains all fhe coefficienfs of /, and all fhe 
exponenfs modulo p,. Collecf sufficienfly many modular images of fhe 
exponenfs in order fo recover fhe full exponenfs over Z. 

4. Recover fhe mulfivariafe exponenfs by D-adic expansion of each univari- 
afe exponenf, and use any /, of maximal sparsify fo discover fhe coeffi- 
cienf of each ferm. 

There are fwo significanf challenges of fhis approach. The firsf is fhaf, in 
reducing fhe pol 5 momial modulo xP' — 1, if is possible fhaf some exponenfs are 
equivalenf modulo p,, and fhen mulfiple ferms in fhe original pol 5 momial will 
collide fo form a single ferm in /,-. By choosing random primes whose values 
are roughly 0(T^ log D), fhe probabilify of encounfering any collisions can be 
made arbifrarily small. 

The second challenge is how fo correlafe fhe exponenfs from differenf /,'s 
in sfep 3, in order fo recover fhe full exponenfs via Chinese remaindering. The 
approach of [Garg and Schosf, 2009] was fo compufe an auxiliary pol 5 momial 
whose roofs are fhe unknown exponenfs; however fhis increases fhe overall 
running fime fo 0(n^T^ log^ D log H). 

The fechnique of diversification in [Giesbrechf and Roche, 2011] is anofher 
randomizafion which chooses a random elemenf a and inferpolafes /(xz) in- 
sfead of /(z) ifself. Wifh high probabilify, fhe "diversified" pol 5 momial /(xz) 
has disfincf coefficienfs, which can fhen be used fo correlafe fhe exponenfs in 
differenf ffs. This avoids fhe factoring sfep and reduces fhe complexify by a 
factor of T. 

Subsequenfly, and separately, [Arnold ef al., 2013] showed how fo allow 
fhe magnifude of each p; fo decrease by a facfor of T, by allowing some con- 
sfanf fraction of fhe ferms in each /; fo collide. These separafe approaches are 
combined and furfher improved in [Arnold ef al., 2015], which in fhe typi¬ 
cal case that n <C log D ^ log H, brings the theoretical complexity down to 
0{nTlog^ DlogH). Observe that this is competitive with the "big primes" al¬ 
gorithm, but could be slower by up to a factor of log H. 
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The algorithm we describe next first shows how to effectively parallelize 
this small primes approach, and then further reduces the complexity through 
a heuristic argument. After the gains from parallelizafion or from fhe heuris¬ 
tic improvemenf, fhe complexify of our algorifhm will be less fhan fhe "big 
primes" approach as well. 


3 Our parallel small primes algorithm 

Our algorifhm, which is defailed in procedure Sparseinferp, is based on fhe 
reducfion idea in [Garg and Schosf, 2009], wifh fhe diversificafion mefhod in- 
froduced by [Giesbrechf and Roche, 2011] and fhe parfial collisions handling of 
[Arnold ef al., 2013]. If depends crucially on fhe paramefers k and i, which in 
fheory grow as 0(n log D) and 0(1), respectively, buf according fo our heuris¬ 
tic mefhod can bofh be freafed as consfanfs. 

3.1 Algorithm overview 

The main idea is firsf fo choose a prime q large enough fo recover fhe coeffi- 
cienfs, and fhen fo apply fhe Kronecker subsfifufion so fhaf we are really infer- 
polafing a univariafe pol 5 momial g{z) wifh coefficienfs in Z/ qZ: 

g = /(az, (az)°,...,(az)°” = ^C/Z^' € {Z/qZ)[z\. 

i=l 

Each ferm ■ ■ ■ xfl' of fhe original polynomial / maps uniquely fo a ferm 
wifh exponenf E, = ei+ e 2 D and coefficienf c; = mod q 

ing. 

The paramefer Q confrols fhe size of fhe prime q, and so should always 
be sef larger fhan 2H in order fo recover fhe full precision of fhe coefficienfs. 
However, if may be necessary fo sef Q even larger fhan fhis when fhe heighf 
bound H is very small. We commenf fhaf choosing a random prime q (rafher 
fhan one wifh special properties, as in fhe "big primes" algorifhm) is imporfanf 
for fhe probabilistic analysis below. 

The evaluation phase of fhe algorifhm compufes fhe polynomial g mod¬ 
ulo (zP — 1) using dense arifhmefic, for many small primes p. The defails of 
how fhis evaluation is performed will depend on fhe parficular applicafion. In 
our mulfivariafe multiplication applicafion below, fhe exponenfs of fhe original 
multiplicands are reduced modulo p, followed by dense pol 5 momial arifhmefic 
in fhe ring of polynomials modulo — 1. More generally, if fhe unknown 
polynomial / is given as a sfraighf-line program or arifhmefic circuif, each op¬ 
eration in fhe circuif can be compufed over fhaf ring (Z/ qZ) [z] / {zf — 1), as in 
[Garg and Schosf, 2009]. In our complexify analysis below, we assume a cosf 
of 0{plogq) for each evaluation on fhis sfep, according fo fhe cosf of dense 
degree-p pol 5 momial arifhmefic over Z/ qZ. 
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The next phase of the algorithm is to gather the images of each term, discard 
those which appear infrequently (and thus were resulting from collisions in 
the reduction modulo zP — 1), and use Chinese remaindering to recover the 
exponents of each nonzero term in /. 

Observe that the list L can be implemented in any convenient way accord¬ 
ing to the details of the parallel implementation; it serves only as an unordered 
accumulator of coefficient-exponent-prime tuples. The output pol 5 momial / 
could also be considered as an unordered accumulator, followed by another 
parallel sort before the final return statement. 

3.2 Parallel complexity analysis 

We state the parameterized rurming time of the algorithm as follows: 

Theorem 1. Given bounds T, D on the sparsity and degree of an unknown polynomial 
f e Z[xi,..., Xn], and parameters k,l,Q, Algorithm Sparseinterp has worst-case 
running time 


O ^ log ^ n log D log Q + kT log Q 

+ (m +logD -hfcTlogQ) y 

Proof. The computation of y at the beginning incurs the O(logt) cost in the 
complexity, which for our ultimate choices of £ will never actually dominate 
the complexity. 

The first loop executes \y./m'\ G 0(1 -|- ^t'nlogD) times in each thread. 
Computing the powers of D modulo p; on Step 9 requires 0(log D -\-n log p, ) 
bit operations. The subsequent evaluations using dense arithmetic on Step 10 
will usually dominate the complexity of the entire algorithm, as each costs 
0{pi log q), which is 0{kT log Q). (The addition of this term makes the log p, 
factor in the cost of Step 9 become 0(1).) 

Both parallel sorts are on lists of size at most pT, which means their cost of 
0{^£nTlog D) does not dominate the complexity. 

The final for loop executes 0(1 -|- ^pT) times in each thread, but the nested 
if statement can only be triggered 0(T) times overall. Within it, the most ex¬ 
pensive step is computing each mod q, requiring 0{n log D log Q) bit op¬ 
erations. This contributes only O(MlogDlogQ) to the overall complexity, as 
the term ^Tn log D log Q is already dominated by the parallel cost of Step 10. 

□ 

The key feature of procedure Sparseinterp is that its potential parallel speedup, 
from the previous theorem, is a factor of 0{£n log D), depending on the num¬ 
ber of parallel processors m that are available. This exceeds the 0(n) parallel 
speedup of previous approaches, and means that our algorithm should scale 
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better to a large number of processors when the number of variables and / or 
degree are sufficienfly large. 

3.3 Correctness and probability analysis 

We firsf use prior work to prove the bounds necessary to ensure correctness 
with provably high probability. 

Theorem 2. If the parameters k, i, Q and bounds D, T, H satisfy 

Q>max(2H,|(£nTlgD)2D), 
k > max(21, \20nlnD]),and 

e>2, 

then with probability at least 1/2, procedure Sparseinterp correctly computes the coef¬ 
ficients and exponents off e Z[xi ,.. .,Xn]. 

Proof. The condition Q > 2H guarantees that positive or negative coefficients 
whose absolute value is at most H will all be distinct modulo q, for any prime 
q ^ Q Sis chosen in fhe algorifhm. 

Because A = kT, Lemma 8 from [Arnold ef al., 2013] fells us fhaf, for each 
Pi, fhe probabilify fhat more fhan T/3 ferms collide modulo — 1 is af mosf 
1/4. Since fhe mosf number of ferms fhaf could collide is T, fhis means fhe 
expected number of collisions, for each p,, is af mosf T/2. 

The fofal number of nonzero coefficienfs in all pol 5 momials /, examined on 
sfep 13 is af mosf yT < ^nT Ig D. Many of fhese correspond fo single ferms in / 
ifself, buf some will be collisions of ferms in /. Using fhe reasoning of Lemma 
4.1 in [Arnold ef al., 2014], and fhe proof of Theorem 3.1 from [Giesbrechf and 
Roche, 2011], every coefficienf of a single ferm in /, or a collision fhaf appears 
in any of fhe f/s, is disfincf modulo q, due fo fhe bound on Q and because 

We conclude fhaf, wifh probabilify af leasf 1/2, each ferm in / appears un¬ 
collided in af leasf y/2 of fhe pol 5 momials /,, and furfhermore fhaf fhese co¬ 
efficienfs are all disfincf and are fhe only ones fhaf repeat in the list L. This 
guarantees that the correct coefficients and exponents are recovered in the fi¬ 
nal loop, and fhe algorifhm oufpufs fhe correcf interpolafed pol 5 momial. □ 

Applying fhe bounds in Theorem 2 fo fhe analysis in Theorem 1 gives fhe 
provable complexify bound for fhe algorifhm as sfafed in (1). 

As usual, fhe probabilify of success in eifher fhe provable or fhe heurisfic 
version of fhe algorifhm (described below) can be increased arbitrarily high 
by running the same algorithm repeatedly and choosing the most common 
pol 5 momial returned among all runs to be the most likely candidate for /. 
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3.4 Heuristic approach 

The heuristic version of this procedure is simply to choose appropriate con¬ 
stants for the parameters k and £, with the intuition that there can be some 
trade-off between the size of each prime (governed by k) and the number of 
chosen primes (governed by £). Furthermore, the bound on Q required in the¬ 
ory to obtain diversification between the coefficients in / and in any collisions 
is urmecessarily high for most "typical" pol 5 momials. There do exist patholog¬ 
ical counterexamples, but they require the degrees of many terms to be equiv¬ 
alent modulo q. As the prime q is also chosen randomly in our approach, we 
have a good indication that this heuristic approach will work with high prob¬ 
ability for a randomly-chosen sparse pol 5 momial. We state this as a conjecture, 
which is also backed by the experimental evidence reported in Section 5 below. 

Conjecture 3. For any sujficiently large height bound H, and using Q = 2H, there 
exist constants k,£ > 1 such that, for a polynomial f e Z[xi, chosen at 

random with at most T nonzero terms, the probability that algorithm Sparseinterp 
successfully interpolates f is at least yi. 

The heuristic complexity under this conjecture is stated in (2). 

3.5 Example 

Consider an unknown bivariate pol 5 momial f{x,y) with 3 nonzero terms and 
degree less than 10. 

The primes that we are going to use are going to be small for the sake of 
this example. The primes are 7,13, and 17. 

1. Now we compute / modulo {z^ — 1)/ we get: 

/ mod -1) =2z‘^ + 7z^ + 3z^ 

/mod (z^^-l) = 10z^-F2z^ 

/ mod (z^ — 1) = 3z^ -F 7z^ -F 2z 


We notice here that a collision happened with the prime 13. This won't 
affect our calculation later because resulting coefficient 10 doesn't appear 
an 5 rwhere else (For this example, we are going to assume that we have a 
good coefficient if it appears twice or more). 

Now we use these values to fill the list L. Every triple in L consists of the 
coefficient, the prime, and the exponent. 


coefficient 

2 

7 

3 

10 

2 

3 

7 

2 

exponent 

9 

8 

3 

7 

4 

6 

3 

1 

prime 

17 

17 

17 

13 

13 

7 

7 

7 
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We then sort L based on coefficient values. 


coefficient 

10 

7 

7 

3 

3 

2 

2 

2 

exponent 

7 

8 

3 

3 

6 

9 

4 

1 

prime 

13 

17 

7 

17 

7 

17 

13 

7 


We use the Chinese remainder theorem to get back the exponent that 
corresponds to each coefficient. 


ei mod 17 = 8 
e\ mod 7 = 3 


=> Cl = 59 


£2 mod 17 = 3 
£2 mod 7 = 6 


=> £2 = 20 


£3 mod 17 = 9 'j 
£3 mod 13 = 4 > ^ £3 = 43 
£3 mod 7 = 1 J 

This results in the univariate polynomial 

/(z) = 3z^° + 22^3 + 7^59^ 

Finally, inverting the Kronecker map f{z) = f{x, we obtain 
f{x,y) = 3i/^ + Ix^y'^ + 7x^i/^. 
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Procedure Sparselnterp(/, n, T, D, H, k, £) 

Input: Bounds T, D, H for an n-variate sparse pol 5 momial / in n 

variables, with T nonzero terms, partial degrees less than D, and 
integer coefficients less than H in absolute value; and parameters 
k,i,Q G Z>o- 

Output: A list of t < T coefficients c, and exponent tuples (e,i, If 

k, ^ are sufficiently large, these coefficients and exponents 
comprise the sparse representation of / with high probability. 

1 q ■ir- random prime in the range [Q, 2Q] 

1 a -ir- random element of (Z/ qZ)* 

3 (ao/ • • ■ ^^n-l) ^ (a, mod q ,... ,a^" ^ mod q) 
i \ ^kT 

5 }i ^ \{inlgD)/lgX] 

6 L <— thread-safe list of integer triples 

7 for i = 1,2,... ,fi in parallel do 

8 Pi <— random prime in the range [A, 2A] 

9 (Do,.. ^ (1,D mod pi,...,D"~'^ mod p,) 

“ /; <—/(aoz^o^... mod (zP'— 1) G (Z/(jZ)[z] using dense 

pol 5 momial arithmetic 

11 for y = 0 , 1 ,..., p, — 1 do 

12 Cij <— coefficient of z^ in /, 

13 if Cij 7 ^ 0 then Add (c/y,;, p/) to L 

14 Sort L in parallel by the coefficients Cq in each triple 

15 F <— thread-safe list of coefficient/exponent tuples 

16 foreach Unique coefficient c in L in parallel do 

17 {del. Pci)/ • • • / {dcu, Peu) the M > 1 exponents and primes appearing 
with coefficient c in L 

18 if w > p/2 then 

19 u' <— least integer s.t. Y\i<i<u' Pd > D” 

20 Ec <— least integer s.t. mod pd for 0 < i < u' via Chinese 

remaindering 

21 c* G Z/pZ, stored as an integer in the range [—p/2, p/2] 

22 (cj,..., e„) D-adic expansion of Ec 

23 Add c* and (ei,... ,en) to F 

24 Sort F in parallel by the exponents and return the resulting sparse 
pol5momial 





4 Parallel implementation 

We completed a low-level implementation of Procedure Sparseinterp written 
in the C programming language. Our complete source code, as well as the exact 
source we tested for fhe comparisons and benchmark problems lisfed lafer, is 
available upon requesf by email. 

We give a few defails here on fhe choices of our implemenfafion, in parfic- 
ular fhe libraries fhaf were utilized. 

4.1 FLINT for sparse and dense polynomial arithmetic 

The key advanfage fo fhe "small primes" approach which we employed is 
fhe reliance on fasf subroufines for dense pol 5 momial arifhmefic. The experi- 
menfs we ran always used a word-sized modulus q, and so fhe mosf expensive 
compufafions involved computing wifh dense, low-degree pol 5 momials wifh 
word-sized modular coefficienfs. 

FLINT (http://flintlib.org/) is a free, open-source C library for fasf 
number fheorefic compufafions [Harf ef al., 2013]. Our dense pol 5 momial arifh¬ 
mefic, which is fhe dominafing cosf bofh in fheory and in practice in our ex- 
perimenfs, was performed using fhe runod_poly dafa t 5 rpe. 

In order fo sfore fhe result of sparse interpolation and complete the correct¬ 
ness testing in our experiments, we also added rudimentary support for sparse 
infeger polynomials on fop of FLINT. We creafed a new dafa t 5 rpe, f mpz_sparse, 
fo represenf sparse univariafe pol 5 momials in Z [x] for fesfing purposes, using 
flint's multiple precision f 5 q)e f mpz as bofh fhe coefficienf and exponenf sfor- 
age for sparse polynomials. 

Nofe fhaf using mulfiple-precision infegers for exponenfs is especially im- 
porfanf, as we have not yef complefed full mulfivariafe pol 5 momial supporf 
within FLINT. Instead, for the purposes of our experimenfs, we always used 
a sfandard Kronecker subsfifufion fo sfore n-variafe, degree-D mulfivariafe 
pol 5 momials as univariafe sparse pol 5 momials wifh degree bounded by D”. 
Employing a mulfiple-precision dafa t 5 rpe allows for fhe largesf possible de¬ 
gree and number of variables, which is crucial as if is precisely fhis supersparse 
case in which our approach has fhe greafesf pofenfial parallel speedup. 

4.2 MPI for multi-processor parallelism 

Message Passing Inferface (MPI) allows us fo parallelize our algorifhm. As 
sfafed earlier, since fhe slowesf parf of our algorifhm involves calculating fhe 
unknown pol 5 momial / modulo many polynomials {xP' — 1), we use MPI fo 
perform each of fhese evaluations in parallel. 

The function MPI_Init is called af fhe begirming of fhe program fo spawn 
an arbifrary number of processes, as specified on fhe command line. Each of 
fhe allocafed processes will be executing separafely wifh separafe copies of all 
variables in fhe original process. All processes will have unique id numbers. 
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The root process will have id number 0. By knowing processes id's we can 
separate what each process executes. 

We used a master-slave model for our algorithm. The master process evenly 
distributes how many primes each slave calculates. After getting the primes, 
the slave process will compute the following for every prime: fi{x) mod 

(jf! — 1), fhen fraverse fhe resulfing univariafe pol 5 momial and save all nonzero 
ferms, along wifh fhe prime p,, fo an array. This array of friples is senf back fo 
fhe masfer process, which lafer sorfs all fhe concafenafed evaluafion arrays 
and uses Chinese remaindering fo recover fhe full pol 5 momial /, as described 
above. 

While our experimenfs were performed on a mulfi-core machine, and hence 
using a simpler fhreading library would have also worked, our goal in using 
MPI was fo demonsfrafe fhe full parallel pofenfial of fhis approach by explicifly 
defailing fhe infer-process communication. Furfhermore, fhe MPI implemen- 
fafion could also be used wifhouf modification on heferogeneous clusfers or 
ofher archifecfures besides mulfi-core. 


5 Experimental results 

We ran our fesfs on a machine using an Infel(R) Core(TM) i7-3930K CPU @ 
3.20GHz simulating 12 h 5 rper-fhreaded cores on 6 physical cores, wifh 32 GB 
of RAM. We used a Debian GNU sysfem, rurming fhe "unsfable" branch, wifh 
Linux kernel version 3.16.0-4-amd64. This is a bleeding-edge sysfem wifh fhe 
mosf currenf versions of all software available wifhin fhe Debian repositories. 

5.1 Determining the parameters k and i 

The firsf fask was fo determine experimenfally whaf kind of seffings for fhe pa¬ 
rameters k and i would be appropriafe for our heurisfic inferpolafion mefhod. 
We found fhaf k = 38 and 1 = 2 worked for a wide range of problem sizes wifh 
almosf no failures in fhe randomized algorifhm. 

The only exceptions we found here were fhaf when fhe bound on fhe num¬ 
ber of ferms T in fhe outpuf was very small, even setting k = 38 fhe number of 
primes p in fhe range [A, 2A] was simply nof sufficienf fo ensure a high success 
probabilify. In fhese small-size exfremes, fhe value of k was increased fo ac- 
commodafe; in parficular, we seffled on A: = 50 and £ = 2 when T < 1000, and 
when T < 100 we had fo selecf k > 10000/T. Under fhese paramefer seffings, 
no failures were observed in any of our experimenfs. 

We commenf fhaf a smaller k value and a larger £ value would be preferable, 
because fhaf would increase fhe number of primes p and hence fhe pofenfial 
parallel speedup for fhe algorifhm, while decreasing fhe size A of each prime p,. 
However, we found fhaf modesfly larger values for £ did nof allow for k fo be 
reduced and consisfenfly resulf in correcf outpuf. We consider fhe exploration 
of some better balance between fhe parameters k and £ as fufure work. 
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ni 

variables 

terms 

max-degree 

F 

A 

Mathemagix 

Ours (single thread) 

Ours (12 threads) 

1 

20 

3 

40 

14 

50000 

0.078 

0.035 

0.029 

2 

20 

9 

80 

18 

5000 

0.155 

0.151 

0.048 

3 

20 

27 

120 

18 

6 900 

0.305 

0.329 

0.116 

4 

20 

81 

160 

18 

4900 

0.598 

0.323 

0.085 

5 

20 

243 

200 

17 

9900 

2.156 

0.785 

0.175 

6 

20 

729 

240 

15 

39900 

5.053 

3.084 

0.814 

7 

20 

2187 

280 

14 

80050 

13.333 

8.714 

2.225 

8 

20 

6561 

320 

13 

321300 

41.070 

43.911 

10.605 


Table 1: Sparse interpolation benchmark times (in seconds) 


5.2 Single-threaded performance 

The first experiment was to test our algorithm without parallelization against 
the efficient "big primes" implementation in Mathemagix (http: //www. mathemagix. 
org/), as reported in [van der Hoeven and Lecerf, 2015]. We downloaded the 
source code for the Mathemagix sparse interpolation program, then ran it with 
m = 1,2,3,4,5,6,7,8 pol 5 momials being multiplied. We kept each pol 5 momial 
at 20 variables, with degree 40, 3 terms and coefficients up to 2^^. Then we ran 
our algorithm with the same parameters. The results are shown in Taible 1 and 
summarized in Figure 1. (We also attempted a comparison against the sinterp 
function in Maple 2015, but it was more than an order of magnifude slower in 
all experimenfs.) 

Our non-parallelized algorifhm seems fo be on par wifh fhe Mafhemagix 
implemenfafion for fhis range of problem sizes. As a comparison, and fo em¬ 
phasize fhe main poinf of our paper, we also show fhe full parallel speedup for 
fhe same problems in Figure 1. 

5.3 Parallel speedup 

The second experiment was to test the parallel speedup for our implemenfafion 
by varying fhe degree size D and leaving k = 38, I = 2, partial /,, and m = 

6 constant. D was varied from 20^^, 20^^, 20^*^*^. This fesf was performed 3 
separafe times and fhe resulfant dafa was calculafed from median of fhe fhree 
frials. The resulfs are seen in Figure 2. 

Figure 2 shows a linear speedup increase as fhe number of parallel pro¬ 
cesses used gefs closer fo fhe physical number of cores on fhe machine. Addi¬ 
tionally, we can see fhe mosf significanf parallel speedup occurs for fhe highesf 
degree fesfed. Recall fhaf on our machine, fhere are only 6 physical cores fhaf 
are h 5 rper-fhreaded fo 12 virfual cores. Furfhermore, when running only six 
fhreads, fhe "furbo mode" clock rafe is increased fo 3.8GHz. This may help fo 
explain fhe dip in performance seen around m = 6 parallel processes. 

Observe also fhat fhe parallel speedup is besf, and continues fhe furthesf, 
in fhe mosf extremely sparse case with very high degree. 

Our parallel speedup is demonstrated in Figure 2. 


15 






Run-Time 



m (number of polynomials) 


Figure 1: Comparison With Mathemagix Sparse Interpolation Program 
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Speedup (singleProcess/parallelProcess in Seconds) 



Figure 2: Parallel speedup for our implementation 
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6 Conclusions and future work 


We have shown that the "small primes" sparse interpolation algorithm is com¬ 
petitive with the state of the art, even without parallelization, especially for 
very sparse problem insfances. Furfhermore, fhere is greafer pofenfial for par¬ 
allelism in the small primes technique. These theoretical results are home out 
in practice in our experimental results compared to other available software 
implementations. 

There is significantly more work to be done, however, before we mighf sug- 
gesf widespread adopfion of our heurisfic sparse inferpolafion mefhod. We 
would like fo undersfand fhe fheory behind fhe heurisfic approach in order fo 
have a less ad hoc way of defermining fhe paramefers k and 1. On fhe ofher 
hand, our implemenfation could be greatly enhanced with further experimen¬ 
tation on a wider range of benchmark problems and incorporafing frue multi- 
variafe sparse pol 5 momial represenfafions. 
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